Supermarket Analysis

FigName

Contact :

linkedinΒ Β  WhatsApp

Introduction

In general, we all have been to supermarkets, whether to buy necessities, travel and so on. Supermarket is a store that sells various kinds of goods. The word supermarket, comes from English which literally means big market. Because supermarkets usually sell a wide variety of goods, we tend to need special tools such as the R programming language to be able to analyze complex data effectively and efficiently. It is hoped that by analyzing effectively and efficiently, we can obtain useful insights in making decisions efficiently and effectively.

Background

We realize that in the current technological era, we must have added values such as the ability to process data effectively and efficiently. Through this awareness and my interest in learning new things such as the R programming language, I tried to make this project. Besides that, this project was structured as a form of practice for me after learning coding skills.

Purpose

  1. This project aims to improve the writer’s understanding and skills in using R programming language.

  2. Then this project aims to be a means for writers to build a portfolio that is much needed in the professional world.

  3. Moreover, I hope to broaden the knowledge and understanding of each reader.

Business Question

The growth of supermarkets in big cities is increasing every year and market competition is also high. We as a data team are asked to analyze the performance of supermarkets and provide recommendations for our supermarkets. The data that we will use contains transactions during the January - March 2019 period.

Data Introduction

On this occasion I try to analyze data about supermarkets provided on the Kaggle platform, this data consists of 1000 rows and 17 columns. Through the data, I tried to prepare and wrangling the data, then do the exploration and analysis of the data and draw conclusions.

Our data comes from 3 cities in Myanmar. We can see the location of this city on the following map. Moreover, when we press the city icon, we can see the iconic image of each city.

library(leaflet)

# get icon
ico <- makeIcon(
    iconUrl = "http://1.bp.blogspot.com/-gR-c6n8olqM/UBnRAyS_m-I/AAAAAAAAAQ8/nmOtsS54nUo/s1600/gambar-bendera-myanmar.JPG",
    iconWidth=750/40, iconHeight=499/40)

Latitude = c(17.02460428627904, 19.766060069490056, 21.954434407595635)
Longitude = c(96.22689582916921, 96.08171350599085, 96.08600825781728)

pts <- data.frame(Latitude, Longitude)

file <- "https://www.abouttimemagazine.co.uk/wp-content/uploads/2016/05/Shwedagon.jpg"
file1 <- "https://www.remotelands.com/remotenew1/dist/images/country/myanmar/city/20061701/b200617001.jpg"
file2 <- "https://1.bp.blogspot.com/-Wx6ZIQMlaIA/WRMT5q1V2mI/AAAAAAAADKA/s545x4vTB_ECUS0zyins68dYSx_g9UbUACLcB/s1600/Mandalay.jpg"

leaflet() %>%
  addTiles %>%
  addMarkers(data = pts, lng =~Longitude, lat =~Latitude, icon=ico,
                   popup = c(paste0("Yangon City","<img src = ", file, ">"),
                             paste0("Naypyitaw City","<img src = ", file1, ">"),
                             paste0("Mandalay City","<img src = ", file2, ">")))

Data Preparation and Wrangling

Library

# ================== Wrangling Data =====================
library(lubridate)
library(tidyr)

# ===================== Scales ===================
library(scales)

# ==================== Visualization =====================
library(glue) # setting tooltip
library(ggplot2)
library(hrbrthemes)
library(plotly)


Data Preparation

supermarket <- read.csv("data_input/supermarket.csv") # read dataset
head(supermarket, 5) # Showing the top 5

Column description:

  • Invoice_ID: Computer generated sales slip invoice identification number
  • Branch: Branch of supermarket (3 branches are available identified by A, B and C).
  • City: Location of supermarket
  • Customer_type: Type of customers, recorded by Members for customers using member card and Normal for without member card.
  • Gender: Gender type of customer
  • Product_line: General item categorization groups - Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel
  • Unit_price: Price of each product in $
  • Quantity: Number of products purchased by customer
  • Tax_5.: tax fee for customer buying
  • Total: Total price including tax
  • Date: Date of purchase (Record available from January 2019 to March 2019)
  • Time: Purchase time (10am to 9pm)
  • Payment: Payment used by customer for purchase (3 methods are available – Cash, Credit card and Ewallet)
  • cogs: Cost of goods sold
  • gross_margin_percentage: Gross margin percentage
  • gross_income: gross income from product sales
  • Rating: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10)


Data Wrangling

Checking Missing Value

anyNA(supermarket)
#> [1] FALSE

Does not have a missing value so we can go directly to the next stage.


Checking Data Duplication

anyDuplicated(supermarket)
#> [1] 0

Does not have duplicated data so we can go directly to the next stage.


Deleting Column Tax_5.

data.frame(supermarket$Tax_5., supermarket$gross_income)[1:7,]

We will delete the Tax.5. column because I suspect there is an error in the Tax_5. column. This column has the same value as supermarket_gross_income so I feel it is better to delete the Tax_5. column.

supermarket$Tax_5. <- NULL
head(supermarket)


Customizing the Dataset Type

Viewing data structure with str() before changing

str(supermarket)
#> 'data.frame':    1000 obs. of  16 variables:
#>  $ Invoice_ID             : chr  "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
#>  $ Branch                 : chr  "A" "C" "A" "A" ...
#>  $ City                   : chr  "Yangon" "Naypyitaw" "Yangon" "Yangon" ...
#>  $ Customer_type          : chr  "Member" "Normal" "Normal" "Member" ...
#>  $ Gender                 : chr  "Female" "Female" "Male" "Male" ...
#>  $ Product_line           : chr  "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
#>  $ Unit_price             : num  74.7 15.3 46.3 58.2 86.3 ...
#>  $ Quantity               : int  7 5 7 8 7 7 6 10 2 3 ...
#>  $ Total                  : num  549 80.2 340.5 489 634.4 ...
#>  $ Date                   : chr  "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
#>  $ Time                   : chr  "13:08" "10:29" "13:23" "20:33" ...
#>  $ Payment                : chr  "Ewallet" "Cash" "Credit card" "Ewallet" ...
#>  $ cogs                   : num  522.8 76.4 324.3 465.8 604.2 ...
#>  $ gross_margin_percentage: num  4.76 4.76 4.76 4.76 4.76 ...
#>  $ gross_income           : num  26.14 3.82 16.22 23.29 30.21 ...
#>  $ Rating                 : num  9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...


Factor Type

First, we will change some column data types Branch, City, Customer_type, Gender, Product_line, Payment to factor data type, this type is intended for data such as categorical. This is because the column has repeated values and aims to save memory that will be used because R does not consider each repeated value as a unique value. Moreover, by converting data into factor, it can make it easier for us to perform statistical calculations and others.

Checking the uniqueness of values from several columns by using the lapply() function so that checking the unique values is not done manually.

lapply(X = supermarket[, c("Branch", "City", 
                       "Customer_type", "Gender", 
                       "Product_line", "Payment")],
       FUN = unique)
#> $Branch
#> [1] "A" "C" "B"
#> 
#> $City
#> [1] "Yangon"    "Naypyitaw" "Mandalay" 
#> 
#> $Customer_type
#> [1] "Member" "Normal"
#> 
#> $Gender
#> [1] "Female" "Male"  
#> 
#> $Product_line
#> [1] "Health and beauty"      "Electronic accessories" "Home and lifestyle"    
#> [4] "Sports and travel"      "Food and beverages"     "Fashion accessories"   
#> 
#> $Payment
#> [1] "Ewallet"     "Cash"        "Credit card"

The process of converting into factor type.

supermarket[, c("Branch", "City", "Customer_type", 
                "Gender", "Product_line", "Payment")] <- # save the result of data type change
lapply(X = supermarket[, c("Branch", "City", "Customer_type",
                           "Gender", "Product_line", "Payment")],
       FUN = as.factor) # Changing data types at once with the lapply() function

Time Type

Second, we will change the Date and Time columns to time type. This is intended to make it easier for us to analyze transactions based on time. We will change with the help of the lubridate library.

Kita mengetahui kolom Date memiliki susunan MM/DD/YYYY atau bulan-hari-tahun sehingga nantinya akan menggunakan fungsi mdy() dan kolom Time memiliki susunan hh:mm atau jam-menit sehingga nantinya akan menggunakan fungsi hms::parse_hm() dari library lubridate.

We know the Date column has the order MM/DD/YYYY or month-day-year so we will use the mdy() function and the Time column has order hh:mm or hour-minute array so we will use the hms function later: :parse_hm() from the lubridate library.

supermarket$Date <- mdy(supermarket$Date)
supermarket$Time <- hms::parse_hm(supermarket$Time)


Viewing data structure with str() after changing

str(supermarket)
#> 'data.frame':    1000 obs. of  16 variables:
#>  $ Invoice_ID             : chr  "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
#>  $ Branch                 : Factor w/ 3 levels "A","B","C": 1 3 1 1 1 3 1 3 1 2 ...
#>  $ City                   : Factor w/ 3 levels "Mandalay","Naypyitaw",..: 3 2 3 3 3 2 3 2 3 1 ...
#>  $ Customer_type          : Factor w/ 2 levels "Member","Normal": 1 2 2 1 2 2 1 2 1 1 ...
#>  $ Gender                 : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 1 1 1 ...
#>  $ Product_line           : Factor w/ 6 levels "Electronic accessories",..: 4 1 5 4 6 1 1 5 4 3 ...
#>  $ Unit_price             : num  74.7 15.3 46.3 58.2 86.3 ...
#>  $ Quantity               : int  7 5 7 8 7 7 6 10 2 3 ...
#>  $ Total                  : num  549 80.2 340.5 489 634.4 ...
#>  $ Date                   : Date, format: "2019-01-05" "2019-03-08" ...
#>  $ Time                   : 'hms' num  13:08:00 10:29:00 13:23:00 20:33:00 ...
#>   ..- attr(*, "units")= chr "secs"
#>  $ Payment                : Factor w/ 3 levels "Cash","Credit card",..: 3 1 2 3 3 3 3 3 2 2 ...
#>  $ cogs                   : num  522.8 76.4 324.3 465.8 604.2 ...
#>  $ gross_margin_percentage: num  4.76 4.76 4.76 4.76 4.76 ...
#>  $ gross_income           : num  26.14 3.82 16.22 23.29 30.21 ...
#>  $ Rating                 : num  9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
  • Our dataset has 1000 observations (rows) and 16 variables (columns).

  • The columns of our dataset already have the appropriate data types:

    • Columns with a lot unique values use character such as Invoice_ID.
    • Columns that have repeated values use factor such as (City, Customer_type etc).
    • Columns related to numbers use number such as (Unit_Price, Total etc).
    • Column related to date using Date such as (Date etc).


Data Exploration and Analysis

Data Summary

summary(supermarket)
#>   Invoice_ID        Branch         City     Customer_type    Gender   
#>  Length:1000        A:340   Mandalay :332   Member:501    Female:501  
#>  Class :character   B:332   Naypyitaw:328   Normal:499    Male  :499  
#>  Mode  :character   C:328   Yangon   :340                             
#>                                                                       
#>                                                                       
#>                                                                       
#>                  Product_line   Unit_price       Quantity         Total        
#>  Electronic accessories:170   Min.   :10.08   Min.   : 1.00   Min.   :  10.68  
#>  Fashion accessories   :178   1st Qu.:32.88   1st Qu.: 3.00   1st Qu.: 124.42  
#>  Food and beverages    :174   Median :55.23   Median : 5.00   Median : 253.85  
#>  Health and beauty     :152   Mean   :55.67   Mean   : 5.51   Mean   : 322.97  
#>  Home and lifestyle    :160   3rd Qu.:77.94   3rd Qu.: 8.00   3rd Qu.: 471.35  
#>  Sports and travel     :166   Max.   :99.96   Max.   :10.00   Max.   :1042.65  
#>       Date                Time                 Payment         cogs       
#>  Min.   :2019-01-01   Length:1000       Cash       :344   Min.   : 10.17  
#>  1st Qu.:2019-01-24   Class1:hms        Credit card:311   1st Qu.:118.50  
#>  Median :2019-02-13   Class2:difftime   Ewallet    :345   Median :241.76  
#>  Mean   :2019-02-14   Mode  :numeric                      Mean   :307.59  
#>  3rd Qu.:2019-03-08                                       3rd Qu.:448.90  
#>  Max.   :2019-03-30                                       Max.   :993.00  
#>  gross_margin_percentage  gross_income         Rating      
#>  Min.   :4.762           Min.   : 0.5085   Min.   : 4.000  
#>  1st Qu.:4.762           1st Qu.: 5.9249   1st Qu.: 5.500  
#>  Median :4.762           Median :12.0880   Median : 7.000  
#>  Mean   :4.762           Mean   :15.3794   Mean   : 6.973  
#>  3rd Qu.:4.762           3rd Qu.:22.4453   3rd Qu.: 8.500  
#>  Max.   :4.762           Max.   :49.6500   Max.   :10.000

πŸ”Ž Summary

  1. We have 1000 transactions from 01-Jan-2019 to 30-Mar-2019.

  2. Dominant transactions occur in Branch A (340 transactions aka 34% of the total).

  3. The highest transaction is in Product_line Fashion accessories (178 transactions aka 17.8% of the total) and the lowest is in Health and beauty (152 transactions aka 15.2%).

  4. Transactions that occur as a whole have a price range from 10.08 - 99.96 with an average of 55.67 (in dollars).

  5. The number of purchases of goods (Quantity) in this supermarket ranges from 1 - 10 units with an average of 5.51 (rounded up to 6 units).

  6. Total purchases (Total) of consumers per transaction are in the range of 10.68 - 1,042.65 with a median of 253.85 (in dollars).

  7. Cost of goods sold (cogs) per transaction was in the range of 10.17 - 993 and the median was 241.76 (in dollars).

  8. Gross income (gross_income) per transaction is in the range of 0.5085 - 49.65 with a median of 12,088 (in dollars) and the percentage of gross income margin taken by the company is constant at 4.762%.

  9. The rating of our 1000 transactions is quite good, 75% of which is numbered more than or equal to 5.5 (based on the first quartile at 5.5) and the median value is 7 out of 10, meaning that 50% of our data has a rating of more than or same 7.


Price Distribution

box <- 
ggplot(data = supermarket, aes(x = Product_line, y = Unit_price)) +
  geom_boxplot(aes(fill = Product_line)) +
  theme(axis.text.x = element_blank()) +
  labs(title = "Price Distribution on Each Product Line",
       x = "Product Line",
       y = "Price",
       fill = "Product Line : ") +
  theme_dark() +
  theme(axis.text.x = element_blank())
  
ggplotly(box)

πŸ’‘ Insight

The price of each Product Line that consumers buy has a ranging from 10 to approximately 99 (in dollars).

  1. ️Electronic Accessories : 75% of Electronic Accessories consumers buy in the price ranging from 27.85 to 75.91 (in dollars).

  2. Fashion Accessories : 75% of Fashion Accessories consumers buy at prices ranging from 35.19 to 81.31 (in dollars).

  3. Food and Beverages : 75% of Food and Beverages consumers buy at prices ranging from 33.33 to 78.88 (in dollars).

  4. Health and Beauty : 75% of Health and Beauty consumers buy at prices ranging from 32.39 to 76.39 (in dollars).

  5. Home and Lifestyle : 75% of Home and Lifestyle consumers buy at prices ranging from 34.20 to 77.58 (in dollars).

  6. Sports and Travel : 75% of Sport and Travel consumers buy at prices ranging from 34.37 to 80.93 (in dollars).

πŸš€ Recommendation

We recommend that when you want to offer a new product or increase the quantity of supermarket sales, we can consider the dominant price range that consumers buy in each product line.


Gender Distribution

# Data Preparation
gdr_produk <- 
as.data.frame(table(supermarket$Product_line, supermarket$Gender))
akm_t <- aggregate(Freq ~ Var1, data = gdr_produk, FUN = sum)

# visualization
bar <- 
ggplot(data = gdr_produk, aes(x = Freq, 
                              y = reorder(Var1, Freq),
                              text = paste("Value is", round(x = (Freq/akm_t$Freq)*100, digits = 2), "%"))) +
  geom_col(aes(fill = Var2), position = "fill") +
  scale_fill_brewer(palette = "Set2") +
   labs(title = "Gender Comparison in Each Product Line",
        x = "Proportion",
        y = NULL,
        fill = "Gender :")

ggplotly(bar, tooltip = "text")

πŸ’‘ Insight

βž” Transactions on the Electornic Accessories and Home and Lifestyle product lines tend to have a fairly balanced gender proportion.

βž” Transactions on the Fashion Accessories, Food and Beverages and Sport and Travel product lines tend to be dominated by gender Female.

βž” Transactions on the Health and Beauty product line tend to be dominated by the gender Male.

πŸ“ƒ Essence

Each product line has a various distribution, there is a balanced gender proportion, more dominant Female and more dominant Male


Consumer Type Distribution

# Data Preparation
pd_cc <- 
as.data.frame(table(supermarket$Product_line, supermarket$Customer_type))

akm_t <- aggregate(Freq ~ Var1, data = pd_cc, FUN = sum)

# visualization
bar1 <- 
ggplot(data = pd_cc, aes(x = Freq, 
                         y = reorder(Var1, Freq),
                         text = paste("Value is", round(x = (Freq/akm_t$Freq)*100, digits = 2), "%"))) +
  geom_col(aes(fill = Var2), position = "fill") +
  scale_fill_brewer(palette = "Set1") +
  labs(title = "Comparison of Consumer Types in Each Product Line",
        x = "Proportion",
        y = NULL,
        fill = "Consumer Types :") 

ggplotly(bar1, tooltip = "text")

πŸ’‘ Insight

βž” In the product line Fashion Accessories, Electronic Accessories, and Health and Beauty, transactions are dominated by Normal type consumers.

βž” In the Food and Beverages, Sport and Travel, and Home and Lifestyle product lines, transactions are dominated by Member type consumers.

πŸš€ Recommendation

βž” We recommend encourage Normal type consumers to become Members when purchasing products in the Fashion Accessories, Electornic Accessories, and Health and Beauty products.

βž” In addition, we also continue to push for the other three product lines to continue to increase the proportion of consumers who member type.

βž” The hope is that by increasing the proportion of Member-type consumers, it will increase the sense of loyalty and make it easier to promote new products and so on. This can be done by providing attractive benefits for members of the consumer type, etc.


Distribution of Number of Transactions Per Hour

# Create hour column (Feature Engineering)
supermarket$hour <- hour(supermarket$Time)

# data preparation
tran_hour <- 
aggregate(Invoice_ID ~ hour + Product_line,
          data = supermarket,
          FUN = length)

# visualization
p <- 
ggplot(tran_hour, aes(x = hour, 
                      y = Product_line, 
                      text = glue(
                              "Product Line : {Product_line}
                              Hour : {hour}
                              Number of Transactions : {Invoice_ID}
                              "))) +
  geom_tile(aes(fill = Invoice_ID)) +
  labs(title = "Distribution of Transaction Amount from Each Product Line and Hour",
       x = "Hour",
       y = NULL,
       fill = "Range") +
  scale_fill_gradient(low="red", high="black") +
  scale_x_continuous(breaks = seq(0,24,1))

ggplotly(p, tooltip="text")

πŸ’‘ Insight

The highest number of transactions is at 19 o’clock on the Food and Beverages product line, which is 26 transactions and the lowest transaction is at 20 pm on the Sport and travel product line, which is 4 transactions.

βž” In the Sports and Travel, a lot of transactions occurs at 13, 16 and 19.

βž” In the Home and Lifestyle, a lot of transactions occurs at 11, 13 and 15.

βž” In the Health and Beauty, a lot of transactions occurs at 10, 13 and 14.

βž” In the Food and Beverages, a lot of transactions occurs at 10, 15, 19 and 20.

βž” In the Fashion Accessories, a lot of transactions occurs at 12, 13, 18 and 19.

βž” In the Electronic Accessories, a lot of transactions occurs at 10, 11 and 18.

πŸš€ Recommendation

βž” By knowing the dominant hours of transactions, we can optimize total sales through efforts to increase the quantity of goods sold during these dominant hours by providing promotions and so on.

βž” We can also provide special attractive promos at hours when there are less transactions, thereby increasing the number of transactions during those hours.


Dominant and Minimal Product Line Purchased

# Data Preparation
vs <- as.data.frame(table(supermarket$Product_line, supermarket$Branch))

# visualization
ggplot(data = vs, mapping = aes(x = Var2, y = Freq)) +
  geom_col(aes(fill = Var1), position = "dodge") +
  geom_label(aes(label = Freq, group=Var1),
             position = position_dodge(width = 0.9),
             vjust = 1.2) +
  labs(title = "Number of Transactions by Product Line in Each Branch",
       subtitle = "Accumulated Transactions from Jan - Mar 2019",
       x = "Branch",
       y = "Number of Transactions",
       fill = "Product Line"
  ) +
   theme(plot.title.position = "plot")

πŸ’‘ Insight

βž” The Health and Beauty product line tends to always occupy the bottom three positions in each branch.

βž” Branch A tends to be dominated by the Home and Lifestyle product line.

βž” Branch B tends to be dominated by the Fashion Accessories and Sporth and Travel product lines.

βž” Branch C tends to be dominated by Food and Beverages and Fashion Accessories product lines.

πŸš€ Recommendation

We recommend that we need to find out further why the Health and Beauty product line has a small number of transactions by looking at whether the marketing team has carried out their duties properly or the problem is with low consumer interest or something else. If there is a problem with the marketing team, it can be fixed such as increasing the effectiveness of promotions, etc. and if the problem is in the lack of consumer interest, then it can be done by providing education so that he is interested in buying the Health and Beauty product line.

Because of branches have different dominant product lines, each branch manager can collaborate with each other to increase overall sales in each branch.


Sales of Each Product Line

# data preparation
tl_agg <- aggregate(Total ~ Product_line, data = supermarket, FUN = sum)

# visualization
ggplot(data = tl_agg, aes(x = Total, y = reorder(Product_line, Total))) +
         geom_col(aes(fill = Total), show.legend = FALSE) +
         scale_fill_continuous(low = "red", high = "black") +
         geom_label(aes(label=Total), hjust = 0.8) +
         labs(title = "Total Sales of Each Product Line from the Entire Supermarket",
              subtitle = "Accumulated from Jan - Mar 2019",
              y = NULL,
              x = "Value of Total Sales") +
         scale_x_continuous(labels = scales::comma) +
         theme(plot.title.position = "plot") +
  theme(panel.background = element_rect(fill = '#F5F5F5', color = 'black'))

πŸ’‘ Insight

βž” Overall, the Food and Beverages product line has the highest total sales and the Health and Beauty product line has the lowest total sales.

βž” In the previous insigt, we saw that the Fashion Accessories product line had the dominant number of transactions in several branches, this is in line with the fact that the Fashion Accessories product line has the highest number of transactions among other product lines, namely 178 or 17.8% of the total transactions. However, the sales value of Fashion Accessories tends to be in the 4th position out of 6, this is because the price of the products purchased by the transaction is relatively cheaper, so that it affects the total sales which is low, aka even though the quantity is large, the price is relatively cheaper.

πŸš€ Recommendation

βž” We recommend that we need to find the cause of the Health and Beauty product line having the lowest total sales and then find a solution like the previous recommendation.

βž” Then we need to increase the sales value of the Fashion Accessories product line.

βž” Overall, we need to continue to maintain and on the one hand continue to encourage the improvement of our product line by improving marketing, customer satisfaction and so on.


Rating Distribution

# Visualization
hist1 <- ggplot(data = supermarket, aes(x = Rating, fill = Branch, 
                                  
                                          )) + 
   geom_histogram(bins = 4, 
                  color = "darkred"
                  ) +
   facet_wrap(~Branch) + 
   labs(title = "Distribution of Consumer Ratings of Each Branch",
        x = 'Rating',
        y = 'Frequency') +
  theme_dark() 

ggplotly(hist1)

πŸ’‘ Insight

βž” These three branches (A, B and C) tend to be dominated by consumers who give a rating of 5 - 9 out of 10.

βž” In Branch A, the number of consumers who gave a rating of 4 with 10 has the similar value.

βž” In Branch B, the number of consumers who gave a rating of 4 tends to be more than that of a rating of 10.

βž” In Branch C, the number of consumers who gave a rating of 4 tends to be less than that of a rating of 10.

πŸš€ Recommendation

We tend to have to increase our customer satisfaction, this can be reflected in a decrease in the number of ratings 4 and an increase in rating 10. The solution can be in the form of being friendly to consumers, a clean atmosphere and so on. </p.


Sales Trend

# Create week column (Feature Engineering)
supermarket$week <-  week(supermarket$Date)

# data preparation
total_trend <- aggregate(Total ~  week + Branch, data = supermarket, FUN = sum)


# visualization 
plt <- 
ggplot(data = total_trend, aes(x = week, y = round(Total, 2), group = Branch, col = Branch,
                                      text = paste("Branch ", Branch,
                                      "<br>Week - ", week,
                                      "<br>Sales: $", round(Total/1000,2),"K"))) +
  geom_line() +
  geom_point() +
  labs(
    title = "Trend of Total Sales From Each Branch",
    subtitle = "Period Jan - Mar 2019 (Week 1 - 13)",
    x = "Week",
    y = "Total sales (in $)") +
  scale_x_continuous(breaks = seq(1,13,1)) +
  scale_y_continuous(breaks = seq(0,12000,1000), labels = unit_format(scale = 10e-4, suffix = "K")) +
  scale_color_manual(values = c("black", "darkgrey", "red")) +
  theme(legend.position = "top",
        plot.title.position = "plot") 

ggplotly(plt, tooltip = "text")

πŸ’‘ Insight

Notes : K = 000

βž” Branch A tends to have a very fluctuating trend, with the highest value in the 3rd week of 11.9K and the lowest at the 13th week of 6.2K.

βž” Branch B tends to have a very fluctuating trend, with the highest value in the 4th week of 11.97K and the lowest at the 13th week of 5.08K.

βž” Branch C tends to have a very fluctuating trend, with the highest value in the 6th week of 12.05K and the lowest at the 13th week of 4.33K.

πŸš€ Recommendation

Should have to build a more stable income and strive for sales next week must be higher than the previous week.


Crowded Day of Consumers

# Create day column
supermarket$day <- wday(supermarket$Date, label = T, abbr = F)

# data preparation
agg_data <- aggregate(Invoice_ID ~ day + Branch, data = supermarket, FUN = length)

# visualization
plt <- 
ggplot(data = agg_data, aes(x = day, y = Invoice_ID, group = Branch, col = Branch,
                            text = glue(
                              "Branch : {Branch}
                              Day : {day}
                              Number of Transactions : {Invoice_ID}
                              ")
                            )) +
  geom_line() +
  geom_point() +
  labs(title = "Trend of Number of Transactions Per Name Day in Each Branch",
       x = "Day",
       y = "Number of Transactions")
ggplotly(plt, tooltip = "text")

πŸ’‘Insight

βž” Branch A has the highest number of transactions on Sunday and the lowest on Wednesday.

βž” Branch B has the highest number of transactions on Saturday and the lowest on Sunday.

βž” Branch C has the highest number of transactions on Tuesday, Saturday and the lowest on Monday, Friday.

πŸ“ƒ Essence

We can see the pattern generated in the chart above tends to after selling down drastically the next day it will start to rise again until it reaches a peak and then declines again.


Three Cities Performance

Based on 3 Indicators : Gross Income, Amount of Transactions and Quantity

# Collect columns containing the total quantity and gross income of each city
data_agg <- aggregate.data.frame(list(Quantity = supermarket$Quantity,
                                      Gross_Income = supermarket$gross_income),
                                by = list(city = supermarket$City),
                                FUN = sum)

# Collect the transaction amount of each city
transaction <- aggregate(supermarket$City,
                       by = list(supermarket$City), 
                       FUN = length)

# Get transaction amount column for data_agg
data_agg$Number_of_Transactions <- transaction$x

# transform data wide -> long
city_pivot <- pivot_longer(data = data_agg, 
                           cols = c("Quantity", "Gross_Income","Number_of_Transactions"))
                             
# Visualization
ggplot(data = city_pivot, aes(x = city, y = value)) +
  geom_col(aes(fill = city), position = "dodge") +
  facet_wrap(~name, scales = "free_y") +
  labs(title = "Three Cities Performance Based on 3 Indicators",
       subtitle = "From January to March 2019",
       x = NULL,
       y = "Value",
       fill = "City :") +
  theme(axis.text.x = element_blank())

πŸ’‘ Insight

βž” with the Gross Income indicator, the city of Naypyitaw tends to have the highest score.

βž” With the Number of Transactions indicator, the city of Yangon tends to have the highest value.

βž” with the Quantity indicator, the city Yangon tends to have the highest score.

πŸ“ƒ Essence

The city of Yangon is indeed superior in two of the three indicators but the City of Naypyitaw tends to be the most profitable city. So the best city performance can be Yangon or Naypyitaw depending on which considerations you want to prioritize.


The Combined Performance of All Supermarkets

# Create month column
supermarket$month <- month(supermarket$Date, label = T, abbr = F)

# data preparation
df_agg <- aggregate.data.frame(list(Total= supermarket$Total,
                          COGS = supermarket$cogs,
                          Gross_Income = supermarket$gross_income),
                     by = list(month = supermarket$month), 
                     FUN = sum)

df_agg <- pivot_longer(data = df_agg,
                       cols = c("Total", "COGS", "Gross_Income"),
                       names_to = "component")


# Visualization
plot_col <- 
ggplot(data = df_agg, aes(x = month, y = value,
                          text = paste("Month :", month, "\n",
                                       "Component :", component,"\n",
                                       "Nilai $ :", round(value/1000,2), "K"))) +
  geom_col(aes(fill = component), position = "dodge") +
  labs(title = " Trend Total, COGS dan Gross Income\n From Jan - Mar 2019",
       x = NULL,
       y = "Value $") +
   scale_y_continuous(labels = unit_format(scale = 10e-4, suffix = "K")) +
   theme_minimal()
  

  
ggplotly(plot_col, tooltip = "text") 

πŸ’‘ Insight

βž” The total sales of our supermarkets decreased in February 2019 compared to January 2019. However, the following month (March 2019) began to experience a recovery with an increase in total sales although it had not returned to the total sales figures in January 2019.

βž” The previous pattern of total sales was similar to the pattern of cost of goods sold (COGS) and gross revenue, which decreased in February 2019 compared to January 2019 and recovered in March 2019 although not yet to the figure in January 2019.

πŸš€ Recommendation

We must encourage sales growth every month so that gross income can be increased. Moreover, we must also seek efficiency in cost of goods sold (COGS) or increase the percentage of gross income per sale in order to increase gross income.



Conclusion

βž” In this project, various things have been reviewed. Starting from the Introduction which contains the beginning and introduction of what will be discussed in this project.

βž” Then we have also discussed the stages of data preparation and wrangling which contains preparing the data and adjusting the data so that it can be processed.

βž” After that, discuss the exploration and analysis of data containing descriptions of transactions that occur such as gender and type of consumers, prices of goods, etc. Then we also try to analyze supermarket performance based on Rating, Product line, Branch, Total sales, COGS, and Gross Income. We also discuss supermarket performance with respect to time such as which hours and weeks have the highest transactions.

βž” We have also analyzed the related insights, recommendations or essence of each graph that has been created.

Overall, the supermarkets that we discuss in this project are good, as can be seen from the positive Gross Income, having various product diversification, good customer satisfaction (the rating is dominated by high scores). However, there is still a note to maintain stable growth in Gross Income and other measures. We also have to boost the performance of the Health and Beauty product line which is in the lowest position compared to other product lines. I rate the overall performance of this supermarket 8 out of 10.